{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ARIMA\n",
    "\n",
    "An [AutoRegressive Integrated Moving Average](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average) model is a popular model used in time series analysis to understand the data or forecast future points.\n",
    "\n",
    "This implementation can fit a model to each time series in a batch and perform in-sample predictions and out-of-sample forecasts. It is designed to give the best performance when working on a large batch of time series.\n",
    "\n",
    "Useful links:\n",
    "\n",
    "- cuDF documentation: https://docs.rapids.ai/api/cudf/stable\n",
    "- cuML's ARIMA API docs: https://docs.rapids.ai/api/cuml/stable/api.html#arima\n",
    "- a good introduction to ARIMA: https://otexts.com/fpp2/arima.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "from cuml.tsa.arima import ARIMA\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading util\n",
    "\n",
    "The data for this demo is stored in a simple CSV format:\n",
    "- the data series are stored in columns\n",
    "- the first column contains the date of the data points\n",
    "- the first row contains the name of each variable\n",
    "\n",
    "For example, let's check the *population estimate* dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat data/time_series/population_estimate.csv | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a helper function to load a dataset with a given name and return a GPU dataframe. We discard the date, and limit the batch size for convenience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_dataset(name, max_batch=4):\n",
    "    import os\n",
    "    pdf = pd.read_csv(os.path.join(\"data\", \"time_series\", \"%s.csv\" % name))\n",
    "    return cudf.from_pandas(pdf[pdf.columns[1:max_batch+1]].astype(np.float64))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualization util\n",
    "\n",
    "We define a helper function that displays the data, and optionally a prediction starting from a given index. Each time series is plot separately for better readability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def visualize(y, pred=None, pred_start=None, lower=None, upper=None):\n",
    "    n_obs, batch_size = y.shape\n",
    "    col = [\"#1f77b4\", \"#ff7f0e\"]\n",
    "\n",
    "    # Create the subplots\n",
    "    c = min(batch_size, 2)\n",
    "    r = (batch_size + c - 1) // c\n",
    "    fig, ax = plt.subplots(r, c, squeeze=False)\n",
    "    ax = ax.flatten()\n",
    "    \n",
    "    # Range for the prediction\n",
    "    if pred is not None:\n",
    "        pred_start = n_obs if pred_start is None else pred_start\n",
    "        pred_end = pred_start + pred.shape[0]\n",
    "    else:\n",
    "        pred_end = n_obs\n",
    "    \n",
    "    # Plot the data\n",
    "    for i in range(batch_size):\n",
    "        title = y.columns[i]\n",
    "        if pred is not None:\n",
    "            ax[i].plot(np.r_[pred_start:pred_end],\n",
    "                       pred[pred.columns[i]].to_numpy(),\n",
    "                       linestyle=\"--\", color=col[1])\n",
    "        # Prediction intervals\n",
    "        if lower is not None and upper is not None:\n",
    "            ax[i].fill_between(np.r_[pred_start:pred_end],\n",
    "                               lower[lower.columns[i]].to_numpy(),\n",
    "                               upper[upper.columns[i]].to_numpy(),\n",
    "                               alpha=0.2, color=col[1])\n",
    "        ax[i].plot(np.r_[:n_obs], y[title].to_numpy(), color=col[0])\n",
    "        ax[i].title.set_text(title)\n",
    "        ax[i].set_xlim((0, pred_end))\n",
    "    for i in range(batch_size, r*c):\n",
    "        fig.delaxes(ax[i])\n",
    "    fig.tight_layout()\n",
    "    fig.patch.set_facecolor('white')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Non-seasonal ARIMA models\n",
    "\n",
    "A basic `ARIMA(p,d,q)` model is made of three components:\n",
    " - An **Integrated** (I) component: the series is differenced `d` times until it is stationary\n",
    " - An **AutoRegressive** (AR) component: the variable is regressed on its `p` past values\n",
    " - A **Moving Average** (MA) component: the variable is regressed on `q` past error terms\n",
    "\n",
    "The model can also incorporate an optional constant term (called *intercept*).\n",
    "\n",
    "### A simple MA(2) example\n",
    "\n",
    "We start with a simple Moving Average model. Let's first load and visualize the *migrations in Auckland by age* dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_mig = load_dataset(\"net_migrations_auckland_by_age\", 4)\n",
    "visualize(df_mig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to fit the model with `q`=2 and with an intercept.\n",
    "The `ARIMA` class accepts cuDF dataframes or array-like types as input (host or device), e.g numpy arrays. Here we already have a dataframe so we can simply pass it to the `ARIMA` constructor with the model parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_mig = ARIMA(df_mig, order=(0,0,2), fit_intercept=True)\n",
    "model_mig.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now forecast and visualize the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fc_mig = model_mig.forecast(10)\n",
    "visualize(df_mig, fc_mig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to get the parameters that were fitted to the model, we can use `get_fit_params` or the corresponding properties. The parameters are organized in 2D arrays: one row represents one parameter and the columns are different batch members."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_mig.get_fit_params()[\"ma\"])\n",
    "print(model_mig.ma_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to get a compact numpy array containing all the parameters with `pack`, or similarly to load the parameters into a model with `unpack`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_mig.pack())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get the log-likelihood of the parameters w.r.t to the series, and evaluate various information criteria:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"log-likelihood:\\n\", model_mig.llf)\n",
    "print(\"\\nAkaike Information Criterion (AIC):\\n\", model_mig.aic)\n",
    "print(\"\\nCorrected Akaike Information Criterion (AICc):\\n\", model_mig.aicc)\n",
    "print(\"\\nBayesian Information Criterion (BIC):\\n\", model_mig.bic)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An ARIMA(1,2,1) example\n",
    "\n",
    "Let's now load the *population estimate* dataset. For this dataset a first difference is not enough to make the data stationary because of the quadratic trend, so we decide to go with `d`=2.\n",
    "\n",
    "This time we won't simply forecast but also predict in-sample:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_pop = load_dataset(\"population_estimate\")\n",
    "\n",
    "# Fit an ARIMA(1,2,1) model\n",
    "model_pop = ARIMA(df_pop, order=(1,2,1), fit_intercept=True)\n",
    "model_pop.fit()\n",
    "\n",
    "# Predict in-sample and forecast out-of-sample\n",
    "pred_pop = model_pop.predict(80, 160)\n",
    "visualize(df_pop, pred_pop, 80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confidence intervals\n",
    "\n",
    "To get confidence intervals when forecasting, we can specify the confidence level (here 95%):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fc_pop, lower_pop, upper_pop = model_pop.forecast(23, level=0.95)\n",
    "visualize(df_pop, fc_pop, lower=lower_pop, upper=upper_pop)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Seasonal ARIMA models\n",
    "\n",
    "[Seasonal ARIMA models](https://otexts.com/fpp2/seasonal-arima.html) are expressed in the form `ARIMA(p,d,q)(P,D,Q)s` and have additional seasonal components that we denote SAR and SMA.\n",
    "\n",
    "We can also choose to apply a first or second seasonal difference, or combine a non-seasonal and a seasonal difference (note: `p+P <= 2` is required).\n",
    "\n",
    "### An ARIMA(1,1,1)(1,1,1)12 example\n",
    "\n",
    "We load the *guest nights by region* dataset. This dataset shows a strong seasonal component with a period of 12 (annual cycle, monthly data), and also a non-seasonal trend. A good choice is to go with `d`=1, `D`=1 and `s`=12.\n",
    "\n",
    "We create the model with seasonal parameters, and forecast:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_guests = load_dataset(\"guest_nights_by_region\", 4)\n",
    "\n",
    "# Create and fit an ARIMA(1,1,1)(1,1,1)12 model:\n",
    "model_guests = ARIMA(df_guests, order=(1,1,1), seasonal_order=(1,1,1,12),\n",
    "                     fit_intercept=False)\n",
    "model_guests.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Forecast\n",
    "fc_guests = model_guests.forecast(40)\n",
    "\n",
    "# Visualize after the time step 200\n",
    "visualize(df_guests[200:], fc_guests)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing observations\n",
    "\n",
    "ARIMA supports missing observations in the data. You can also pad your dataset at the start in order to batch computations even if the series have different lengths.\n",
    "\n",
    "To illustrate that, let's create a fake dataset from the seasonal dataset used above. We will simulate series of different lengths and add missing observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cut dataset to 100 observations\n",
    "df_guests_missing = df_guests[:100].copy()\n",
    "\n",
    "for title in df_guests_missing.columns:\n",
    "    # Missing observations at the start to simulate varying lengths\n",
    "    n_leading = random.randint(5, 40)\n",
    "    df_guests_missing[title][:n_leading]=None\n",
    "    \n",
    "    # Random missing observations in the middle\n",
    "    missing_obs = random.choices(range(n_leading, 100), k=random.randint(5, 20))\n",
    "    df_guests_missing[title][missing_obs]=None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that missing observations need to be represented by the value `NaN` to convert the dataset to a numeric array. `NA`s in dataframes can be filled with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_guests_missing = df_guests_missing.fillna(np.nan)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now fit a model. Here we will do in- and out-of-sample predictions, to showcase how this model can fill the gaps in data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create and fit an ARIMA(1,1,1)(1,1,1)12 model:\n",
    "model_guests_missing = ARIMA(df_guests_missing, order=(1,1,1), seasonal_order=(1,1,1,12),\n",
    "                             fit_intercept=False)\n",
    "model_guests_missing.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Forecast\n",
    "fc_guests_missing = model_guests_missing.predict(0, 120)\n",
    "\n",
    "visualize(df_guests_missing, fc_guests_missing, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the model can't form predictions at the start where we padded with missing values. The first in-sample predictions will be equal to a constant value (0 in the absence of intercept)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exogenous variables\n",
    "\n",
    "ARIMA also supports exogenous variables. As this model works with a batch, there are some limitations: each column of `endog` corresponds to a fixed number of columns of `exog`. Exogenous variables can't be shared by different series (they have to be duplicated if necessary), and all series must have the same number of exogenous variables. The shape of `exog` must be `(n_obs, batch_size * n_exog)`, and columns are grouped by corresponding batch id. For predictions and forecasts, values of the exogenous variables for future steps must be provided in an array of shape `(nsteps, batch_size * n_exog)`.\n",
    "\n",
    "Note that endogenous variables might contain missing observations but exogenous variables cannot.\n",
    "\n",
    "To illustrate this, we will again create a fake dataset from the one used above, adding some procedural variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb = 4\n",
    "\n",
    "# Generate exogenous variables and coefficients\n",
    "get_sine = lambda n, period: \\\n",
    "    np.sin(np.r_[:n] * 2 * np.pi / period + np.random.uniform(0, period))\n",
    "np_exog = np.column_stack([get_sine(319, T)\n",
    "                           for T in np.random.uniform(20, 100, 2 * nb)])\n",
    "np_exog_coef = np.random.uniform(20, 200, 2 * nb)\n",
    "\n",
    "# Create dataframes for the past and future values\n",
    "df_exog = cudf.DataFrame(np_exog[:279])\n",
    "df_exog_fut = cudf.DataFrame(np_exog[279:])\n",
    "\n",
    "# Add linear combination of the exogenous variables to the endogenous\n",
    "df_guests_exog = df_guests.copy()\n",
    "for ib in range(nb):\n",
    "    df_guests_exog[df_guests_exog.columns[ib]] += \\\n",
    "        np.matmul(np_exog[:279, ib*2:(ib+1)*2], np_exog_coef[ib*2:(ib+1)*2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create and fit an ARIMA(1,0,1)(1,1,1)12 (c) model with exogenous variables\n",
    "model_guests_exog = ARIMA(endog=df_guests_exog, exog=df_exog,\n",
    "                          order=(1,0,1), seasonal_order=(1,1,1,12),\n",
    "                          fit_intercept=True)\n",
    "model_guests_exog.fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Forecast\n",
    "fc_guests_exog = model_guests_exog.forecast(40, exog=df_exog_fut)\n",
    "\n",
    "# Visualize after the time step 100\n",
    "visualize(df_guests_exog[100:], fc_guests_exog)"
   ]
  }
 ],
 "metadata": {
  "file_extension": ".py",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "mimetype": "text/x-python",
  "name": "python",
  "npconvert_exporter": "python",
  "pygments_lexer": "ipython3",
  "version": 3
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
